1 Project summary and objective

1.1 Research Question

Our interest lies in direct marketing campaigns. We want to know the effectiveness of the direct marketing campaigns, and whether a customer would subscribe to a term deposit through a direct marketing campaign. Also, we want to predict after how many marketing campaigns, the customer would subscribe a term deposit. Moreover, we want to learn whether other attributes like, job, age, balance and loan would affect the result of subscribing a marketing campaigns.

1.2 Project Objective

Build models to predict whether a customer would subscribe a term deposit or not. After, we build the model and find a pattern for customer who have more possibilty to subcribe a term deposit. The bank can put more human resources on the target customers instead of making worthless effort. Help the bank to increase the marketing campaigns successful rate.

2 Dataset summmary

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. It consists of 41188 observations with 16 attributes including bank clients data, data related with the last contact of the current campaign, social and economic context attributes and other attributes.

2.1 Input variables

2.1.1 7 Numeric variables:

age: the customers age
balance: the balance of the customers
duration: last contact duration, in seconds.
campaign: number of contacts performed during this campaign and for this client.
pdays: number of days that passed by after the client was last contacted from a previous campaign(-1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
day: last contact day of the month

2.1.2 9 Categorical variables:

job : type of job
marital: marital status
education: education level
default: has credit in default?
housing: has a housing loan?
loan: has a personal loan?
contact: contact communication type
month: last contact month of year
poutcome: outcome of the previous marketing campaign

2.1.3 Output variables

y - has the client subscribed to a term deposit? (binary: ‘yes’,‘no’)

3 Univariate Analysis summary tables of measures and categories

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
summarize_numeric = function(dataset) {
  
  dataset = select_if(dataset, is.numeric)
  summary.table = data.frame(Attribute = names(dataset))
  
  summary.table = summary.table %>% 
    mutate('Missing Values' = apply(dataset, 2, function (x) sum(is.na(x))),
           'Unique Values' = apply(dataset, 2, function (x) length(unique(x))),
           'Mean' = colMeans(dataset, na.rm = TRUE),
           'Min' = apply(dataset, 2, function (x) min(x, na.rm = TRUE)),
           'Max' = apply(dataset, 2, function (x) max(x, na.rm = TRUE)),
           'SD' = apply(dataset, 2, function (x) sd(x, na.rm = TRUE))
    )
  summary.table
}

summarize_character = function(dataset) {
  
  dataset = select_if(dataset, is.character)
  summary.table = data.frame(Attribute = names(dataset))
  
  summary.table = summary.table %>% 
    mutate('Missing Values' = apply(dataset, 2, function (x) sum(is.na(x))),
           'Unique Values' = apply(dataset, 2, function (x) length(unique(x))),
    )
  summary.table
}

Give a summary view of the data.

bank = read_csv('bank-full.csv',show_col_types = FALSE)
sc_bank <- summarize_character(bank)
sn_bank<- summarize_numeric(bank) %>% mutate_if(is.numeric, round, digits = 2)

library(knitr)

knitr::kable(sn_bank,"simple")
Attribute Missing Values Unique Values Mean Min Max SD
age 0 77 40.94 18 95 10.62
balance 0 7168 1362.27 -8019 102127 3044.77
day 0 31 15.81 1 31 8.32
duration 0 1573 258.16 0 4918 257.53
campaign 0 48 2.76 1 63 3.10
pdays 0 559 40.20 -1 871 100.13
previous 0 41 0.58 0 275 2.30
knitr::kable(sc_bank,"simple")
Attribute Missing Values Unique Values
job 0 12
marital 0 3
education 0 4
default 0 2
housing 0 2
loan 0 2
contact 0 3
month 0 12
poutcome 0 4
y 0 2
bank = bank %>% mutate(job = as.factor(job),
                       marital = as.factor(marital),
                       education= as.factor(education),
                       default = as.factor(default), 
                       housing = as.factor(housing), 
                       loan = as.factor(loan),
                       contact = as.factor(contact),
                       month = as.factor(month),
                       poutcome = as.factor(poutcome),
                       y = as.factor(y))
colnames(bank %>% select_if(is.factor))
##  [1] "job"       "marital"   "education" "default"   "housing"   "loan"     
##  [7] "contact"   "month"     "poutcome"  "y"
colnames(bank %>% select_if(is.numeric))
## [1] "age"      "balance"  "day"      "duration" "campaign" "pdays"    "previous"

There are 10 attributes in the categories attributes which are job, marital, education , default, housing, loan, contact, month, poutcome, and y. There are 7 attributes in measures which are age, balance, day, duration, campaign, pdays, and previous. Moreover, from the summary table, we find that the data is quite clean. There is no missing values in all the attributes. All the attributes have more than 2 unique values which means, we do not need to delete any attribute at this point.

4 Univariate analysis visualizations

4.1 Numeric Attributes

First, we draw the numeric attributes distributions, and we find that the value of balance, pdays, and previous are quite concentrated. Most of the value of valance are around 0.The pdays values are concentrated at -1 and the previous values are concentrated at 0. The previous attributes means number of contacts performed before this campaign and for this client and pdays means number of days that passed by after the client was last contacted from a previous campaign and -1 means this customer was not previous contacted. In this way, the previous and pdays are high correlated and the customer which previous is 0 value is same as the customers which have pdays value -1 which means these customers was not previous contacted. In this way, we can seperate the data into contacted before and was not contacted before.We can have a better distribution plot of pdays and previous with the customers were contacted before.

p1 = ggplot(bank) + geom_bar(aes(x = age))
p2 = ggplot(bank) + geom_bar(aes(x = balance), width = 500)
p3 = ggplot(bank) + geom_bar(aes(x = day))
p4 = ggplot(bank) + geom_bar(aes(x = duration))
p5 = ggplot(bank) + geom_bar(aes(x = campaign))
p6 = ggplot(bank) + geom_bar(aes(x = pdays))
p7 = ggplot(bank) + geom_bar(aes(x = previous))

grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for all the customers")
## Warning: position_stack requires non-overlapping x intervals

bank_contacted <- bank[bank$previous!=0,]

Now we draw the distribution plot for contacted customers.

p1 = ggplot(bank_contacted) + geom_bar(aes(x = age))
p2 = ggplot(bank_contacted) + geom_bar(aes(x = balance))
p3 = ggplot(bank_contacted) + geom_bar(aes(x = day))
p4 = ggplot(bank_contacted) + geom_bar(aes(x = duration))
p5 = ggplot(bank_contacted) + geom_bar(aes(x = campaign))
p6 = ggplot(bank_contacted) + geom_bar(aes(x = pdays))
p7 = ggplot(bank_contacted) + geom_bar(aes(x = previous))

grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for contacted customers")

4.2 Categorical Attributes

After we drawed the categorical distrubutions we found that there are significant numbers of customers are blue-collar, management and technician. Most cusromers did not have credit. Moreover, most customers are contacted through cellular.

p8 = ggplot(bank) + geom_bar(aes(x = job)) + theme(axis.text.x = element_text(angle=20, hjust = 1, size=8))
p9 = ggplot(bank) + geom_bar(aes(x = marital))
p10 = ggplot(bank) + geom_bar(aes(x = education))
p11 = ggplot(bank) + geom_bar(aes(x = default))
p12 = ggplot(bank) + geom_bar(aes(x = housing))
p13 = ggplot(bank) + geom_bar(aes(x = loan))
p14 = ggplot(bank) + geom_bar(aes(x = contact))
p15 = ggplot(bank) + geom_bar(aes(x = month))
p16 = ggplot(bank) + geom_bar(aes(x = poutcome))
p17 = ggplot(bank) + geom_bar(aes(x = y))

grid.arrange(p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, nrow=5, top = "Categorical Attributes")

5 Correlation matrix

From the Correlation matrix we found that the correlation between pdays and previous is quite high. And for the rest of the attributes the correlations are quiet low which is good.

library(ggcorrplot)
fullCorrMatrix = round(cor(bank %>% select_if(is.numeric)), 2)
ggcorrplot(fullCorrMatrix, type = "lower", outline.col = "white",lab = TRUE)

6 Bivariate Analysis

6.1 Measure/Measure

No linear relationship among most numeric attributes. But we can find the tendency that when the ‘balance’ becomes higher, the ‘duration’, ‘campaign’ and ‘previous’ are more likely to be lower. and it also appears in ‘duration’ and ‘pdays’.

library(gridExtra)
library(ggcorrplot)

bank <- filter(bank, duration > 0 & pdays < 999)

#balance
gg8 = ggplot(bank) + geom_point(aes(x=`balance`, y = `duration`))
gg9 = ggplot(bank) + geom_point(aes(x=`balance`, y = `campaign`))
gg11 = ggplot(bank) + geom_point(aes(x=`balance`, y = `previous`))

grid.arrange( gg8, gg9,  gg11,nrow=3)

#day
gg12 = ggplot(bank) + geom_point(aes(x=`day`, y = `duration`))
gg14 = ggplot(bank) + geom_point(aes(x=`day`, y = `pdays`))

grid.arrange(gg12, gg14, nrow=2)

#duration
gg17 = ggplot(bank) + geom_point(aes(x=`duration`, y = `pdays`))

grid.arrange( gg17)

#campaign
gg19 = ggplot(bank) + geom_point(aes(x=`campaign`, y = `pdays`))

grid.arrange(gg19)

6.2 Category/ Category

Last contact month of year is correlated to the client’s job. Job management accounts for the large proportions in each month. Job is impacted by the level of the education. ### job by Category

g3 = ggplot(bank) + geom_bar(aes(x=education, fill = job), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = job), position = "fill") + labs(y = "Percent")

grid.arrange(g3, g8,  nrow=2, top = "job by Category")

### education by Category

g1 = ggplot(bank) + geom_bar(aes(x=job, fill =education), position = "fill") + labs(y = "Percent")

grid.arrange(g1, nrow=1, top = "education by Category")

No apparent relationship or unexpected observation between other categories for the category distribution.

6.3 Category/ Measure

While looking at measure distribution by different category values, nothing too surprising or unexpected observation for this dataset.

6.3.1 pdays by category

When the client outcome of previous marketing campaign is success, the number of days that passed by after the client was last contacted tend to be smaller than failure and other clients.

cm18 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = pdays)) + theme(axis.title.y = element_blank())

grid.arrange(cm18, nrow=1, top = "age by Category")

6.3.2 age by category

Some correlation between age and contact method, marital status.

cm51 = ggplot(bank) + geom_boxplot(aes(x=marital, y = age)) + theme(axis.title.y = element_blank())
cm56 = ggplot(bank) + geom_boxplot(aes(x=contact, y = age)) + theme(axis.title.y = element_blank())
grid.arrange(cm51,cm56, nrow=1, top = "pdays by Category")

7 Modeling

7.1 Random Forest

# splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
train_sub <- sample(nrow(bank),0.7*nrow(bank))
train_bank <-bank[train_sub,] 
test_bank <-bank[-train_sub,]
#pairs(bank)

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
#find the best 'mtry'(Number of variables available for splitting at each tree node) :
rf_bank_train1<-randomForest(y~.,data=train_bank,importance=TRUE, mtry=2, na.action = na.pass)
rf_bank_train1
## 
## Call:
##  randomForest(formula = y ~ ., data = train_bank, importance = TRUE,      mtry = 2, na.action = na.pass) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 9.57%
## Confusion matrix:
##        no  yes class.error
## no  27539  440  0.01572608
## yes  2589 1077  0.70621931
rf_bank_train2<-randomForest(y~.,data=train_bank,importance=TRUE, mtry=3, na.action = na.pass)
rf_bank_train2
## 
## Call:
##  randomForest(formula = y ~ ., data = train_bank, importance = TRUE,      mtry = 3, na.action = na.pass) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.07%
## Confusion matrix:
##        no  yes class.error
## no  27138  841  0.03005826
## yes  2030 1636  0.55373704
rf_bank_train3<-randomForest(y~.,data=train_bank,importance=TRUE, mtry=4, na.action = na.pass)
rf_bank_train3
## 
## Call:
##  randomForest(formula = y ~ ., data = train_bank, importance = TRUE,      mtry = 4, na.action = na.pass) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 9.04%
## Confusion matrix:
##        no  yes class.error
## no  27006  973  0.03477608
## yes  1889 1777  0.51527550

From the result, we can see when mtry = 4, we could find the minimum OOB estimate of error rate is 9.13%.

#plot the number of trees
plot(rf_bank_train3)

#the importance of variables 
rf_bank_train3$importance
##                      no           yes MeanDecreaseAccuracy MeanDecreaseGini
## age        7.496521e-03  0.0060447612         7.326765e-03        569.24017
## job        3.893014e-03 -0.0023008840         3.175810e-03        436.47141
## marital    5.009627e-04  0.0052322811         1.048044e-03        125.44960
## education  1.217041e-03  0.0003085425         1.111189e-03        162.55752
## default   -2.461212e-06  0.0002622862         2.825892e-05         10.06423
## balance    7.890527e-04  0.0045177634         1.220080e-03        609.06266
## housing    5.960913e-03  0.0106493644         6.501830e-03        122.25261
## loan       7.926129e-05  0.0029164460         4.078979e-04         49.86678
## contact    3.702761e-02  0.0037847642         3.318247e-02        114.73747
## day        1.851379e-02  0.0023029936         1.664114e-02        511.99858
## month      7.063417e-02  0.0196806179         6.474106e-02        754.47495
## duration   2.077674e-02  0.1963493248         4.108814e-02       1783.61650
## campaign   1.690049e-03  0.0034100508         1.887987e-03        226.13347
## pdays      2.269822e-02  0.0227813053         2.270797e-02        273.55473
## previous   1.381749e-02  0.0084270203         1.319319e-02        131.79404
## poutcome   1.988401e-02  0.0222062205         2.015739e-02        454.84322
#plot the importance
varImpPlot(rf_bank_train3, main = "variable importance")

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
#Predicting in test data set
Predict_rf <- predict(rf_bank_train3, newdata=test_bank, type = "class")

rf_cf <- caret::confusionMatrix(as.factor(Predict_rf),as.factor(test_bank$y) )
rf_cf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11514   854
##        yes   426   769
##                                           
##                Accuracy : 0.9056          
##                  95% CI : (0.9006, 0.9105)
##     No Information Rate : 0.8803          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4945          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9643          
##             Specificity : 0.4738          
##          Pos Pred Value : 0.9310          
##          Neg Pred Value : 0.6435          
##              Prevalence : 0.8803          
##          Detection Rate : 0.8489          
##    Detection Prevalence : 0.9119          
##       Balanced Accuracy : 0.7191          
##                                           
##        'Positive' Class : no              
## 
#boosting
set.seed(1)
library(gbm)
## Loaded gbm 2.1.8
library(survival)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
train_bank$y = ifelse(train_bank$y == "yes",1,0)
bank_gb = gbm(y~.,distribution = "bernoulli",data = train_bank,n.trees = 500,interaction.depth = 4,cv.folds = 3)
summary(bank_gb)
var rel.inf
duration duration 35.3469104
month month 20.6717401
poutcome poutcome 13.8235783
job job 5.6459656
age age 5.0004557
day day 4.7525821
balance balance 3.2943917
contact contact 3.0841049
pdays pdays 2.9610691
housing housing 2.3891585
campaign campaign 0.7379019
marital marital 0.7315589
education education 0.6833697
loan loan 0.4300640
previous previous 0.4036700
default default 0.0434793
#confusion matrix
set.seed(1)
Predict_rf <- predict(rf_bank_train3, newdata=test_bank)
yhat_boost =predict(bank_gb, newdata = test_bank, n.trees=500)
boost_err =table(pred =Predict_rf, truth = test_bank$y) 
colnames(boost_err) <- c("No","Yes")

7.2 Decision Tree

## 70% of the sample size
smp_size <- floor(0.7 * nrow(bank))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(bank)), size = smp_size)

train <- bank[train_ind, ]
test <- bank[-train_ind, ]
library(rpart)
library(rpart.plot)
ct <- rpart.control(xval=10, minsplit=20, cp=0.01)  
cfit <- rpart(y~.,
              data=train, method="class", control=ct,
              parms=list(split="gini")
              )
rpart.plot(cfit, main="Decision Tree")

library(tree)
## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli
summary(tree(y~., data=train, method = "class"))
## 
## Classification tree:
## tree(formula = y ~ ., data = train, method = "class")
## Variables actually used in tree construction:
## [1] "duration" "poutcome" "month"    "contact" 
## Number of terminal nodes:  9 
## Residual mean deviance:  0.4879 = 15440 / 31640 
## Misclassification error rate: 0.1095 = 3465 / 31645
p <- predict(cfit, test ,type="class")
table(p,  test$y)
##      
## p        no   yes
##   no  11706  1013
##   yes   321   523

7.3 Logistic Regression

# splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
dt = sort(sample(nrow(bank), nrow(bank)*.7))
train<-bank[dt,]
test<-bank[-dt,]
mylogit <- glm(y ~., data = train, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6779  -0.3733  -0.2505  -0.1481   3.4237  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.353e+00  2.194e-01 -10.724  < 2e-16 ***
## age                -1.364e-03  2.653e-03  -0.514 0.607229    
## jobblue-collar     -3.820e-01  8.702e-02  -4.390 1.13e-05 ***
## jobentrepreneur    -3.622e-01  1.484e-01  -2.440 0.014688 *  
## jobhousemaid       -5.305e-01  1.617e-01  -3.280 0.001037 ** 
## jobmanagement      -1.757e-01  8.724e-02  -2.014 0.044031 *  
## jobretired          3.204e-01  1.152e-01   2.781 0.005426 ** 
## jobself-employed   -3.168e-01  1.336e-01  -2.372 0.017708 *  
## jobservices        -3.503e-01  1.021e-01  -3.430 0.000605 ***
## jobstudent          3.905e-01  1.272e-01   3.071 0.002133 ** 
## jobtechnician      -2.166e-01  8.191e-02  -2.645 0.008180 ** 
## jobunemployed      -2.509e-01  1.326e-01  -1.892 0.058454 .  
## jobunknown         -4.379e-01  2.767e-01  -1.583 0.113518    
## maritalmarried     -1.842e-01  7.072e-02  -2.604 0.009204 ** 
## maritalsingle       7.543e-02  8.066e-02   0.935 0.349673    
## educationsecondary  1.677e-01  7.809e-02   2.147 0.031767 *  
## educationtertiary   3.562e-01  9.031e-02   3.945 7.99e-05 ***
## educationunknown    2.685e-01  1.256e-01   2.137 0.032578 *  
## defaultyes         -2.201e-01  2.063e-01  -1.067 0.285993    
## balance             1.946e-05  5.813e-06   3.349 0.000812 ***
## housingyes         -6.642e-01  5.241e-02 -12.674  < 2e-16 ***
## loanyes            -4.085e-01  7.153e-02  -5.711 1.12e-08 ***
## contacttelephone   -2.517e-01  9.107e-02  -2.764 0.005705 ** 
## contactunknown     -1.652e+00  8.855e-02 -18.658  < 2e-16 ***
## day                 1.227e-02  2.985e-03   4.109 3.97e-05 ***
## monthaug           -7.402e-01  9.266e-02  -7.989 1.37e-15 ***
## monthdec            6.197e-01  2.171e-01   2.854 0.004312 ** 
## monthfeb           -1.274e-01  1.045e-01  -1.218 0.223043    
## monthjan           -1.207e+00  1.406e-01  -8.584  < 2e-16 ***
## monthjul           -9.357e-01  9.257e-02 -10.108  < 2e-16 ***
## monthjun            4.156e-01  1.119e-01   3.714 0.000204 ***
## monthmar            1.516e+00  1.426e-01  10.634  < 2e-16 ***
## monthmay           -4.393e-01  8.566e-02  -5.128 2.93e-07 ***
## monthnov           -9.575e-01  1.004e-01  -9.539  < 2e-16 ***
## monthoct            8.457e-01  1.295e-01   6.529 6.63e-11 ***
## monthsep            7.651e-01  1.455e-01   5.258 1.45e-07 ***
## duration            4.162e-03  7.692e-05  54.107  < 2e-16 ***
## campaign           -8.619e-02  1.196e-02  -7.208 5.67e-13 ***
## pdays              -4.089e-04  3.691e-04  -1.108 0.267949    
## previous            1.012e-02  6.848e-03   1.479 0.139269    
## poutcomeother       2.006e-01  1.079e-01   1.859 0.062979 .  
## poutcomesuccess     2.356e+00  9.868e-02  23.874  < 2e-16 ***
## poutcomeunknown    -1.618e-01  1.115e-01  -1.451 0.146707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22872  on 31644  degrees of freedom
## Residual deviance: 15039  on 31602  degrees of freedom
## AIC: 15125
## 
## Number of Fisher Scoring iterations: 6
library(caret)
glm.probs <- predict(mylogit,test,type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
glm.pred <- as_factor(glm.pred)
confusionMatrix(glm.pred,test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11681  1029
##        yes   303   550
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8967, 0.9068)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 7.089e-12       
##                                           
##                   Kappa : 0.4036          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9747          
##             Specificity : 0.3483          
##          Pos Pred Value : 0.9190          
##          Neg Pred Value : 0.6448          
##              Prevalence : 0.8836          
##          Detection Rate : 0.8612          
##    Detection Prevalence : 0.9371          
##       Balanced Accuracy : 0.6615          
##                                           
##        'Positive' Class : no              
## 
library(ROCR)
pr <- prediction(glm.probs, test$y)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

auc <- as.numeric(performance(pr, "auc")@y.values)
auc
## [1] 0.9044215
plot(prf,
     lwd = 3, colorize = TRUE,
     text.adj = c(-0.2, 1.7),
     main = 'ROC Curve')
     mtext(paste('auc : ', round(auc, 5)))
     abline(0, 1, col = "red", lty = 2)

glm.pred2 <- ifelse(glm.probs > 0.09 , "yes", "no")
glm.pred2 <- as_factor(glm.pred2)
confusionMatrix(glm.pred2,test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9418  186
##        yes 2566 1393
##                                           
##                Accuracy : 0.7971          
##                  95% CI : (0.7902, 0.8038)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4038          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7859          
##             Specificity : 0.8822          
##          Pos Pred Value : 0.9806          
##          Neg Pred Value : 0.3519          
##              Prevalence : 0.8836          
##          Detection Rate : 0.6944          
##    Detection Prevalence : 0.7081          
##       Balanced Accuracy : 0.8340          
##                                           
##        'Positive' Class : no              
## 

7.4 Summary of modeling results and conclusions

For random forest, we use a loop to find the best ‘mtry’ (Number of variables available for splitting at each tree node), which is 4, with the lowest out of bag estimate of error rate: 9.13%. We find that duration is the most important factor. The longer contact duration lasts, the higher probability that the client would subscribe the term deposit. Month and day are also important factors. The model test accuracy is 90.92%, which seems a good classification model for the prediction. The Kappa is 0.501, which shows good consistency. Then, we use gradient boosting to display the relative influence plot and the relative influence statistics. From the plot, we can also conclude that duration, month and poutcome are three most important variables among all the predictors.

For decision tree, duration is also the most important factor. If the duration is greater than 830 then the customer would have a higher possibility to subscribe the term deposit. More over the poutcome is also an important factor, if the contact duration is not long enough, but the customer has subscribed a term deposit before. The customer would have a high possibility to subscribe a term deposit in this campaign.

For the part of Logistic Regression, we first randomly select 70% of the data into training dataset and put the remaining 30% of the data into testing dataset.Secondly, we built a logistic model by using glm() function. By looking at the p-value in the summary, we found duration, month, day, higher education and campaigns, these are the attributes that highly significant in this model. On the other hand, attributes like age, pdays and previous are not statistically significant at level 0.05. After we build the model, we first make the prediction at threshold equal to 0.5, which will predict positive if the probability is bigger than the threshold. As we can see in the confusion matrix, the accuracy is about 90 percent. So the model actually made a very good prediction.However, while we made the ROC curve, we found the optimal threshold is 0.09 in the curve. We further create a second prediction, but the accuracy is lower to about 80 percent. So that’s kind of an interesting result. Since it is logistic model, we can customize the threshold when facing different problems. There always will be a tradeoff between true positive and true negative while we optimize that. Comparing these two matrices, we can observe that when true positives decreases, the true negative increases. Since we aim to find out the effectiveness of market campaigns and target customers, we want to maximize the true negative, which is the customer truly say yes. So we choose 0.09 as our threshold.

All our models have high predictive power as indicated by duration which is the last contact duration with the costumers.We can also conclude and suggest that company to focus on their previous customer who have subscribed a term deposit in the previous campaign. Ideal target customer is in the mid-age with higher education. We also recommend keeping the contact duration as long as possible.From the logistic regression, we find out that higher education, longer contact duration seem to be predictive and positively correlated with the subscription yes. And 0.09 is a ideal cut-off point for this business problem.From the random forest and boosting model, we can see when the outcome of the previous marketing campaign is successful, the last contact month of year gets closer and the duration lasts longer, the client is more likely to subscribe the term deposit.Our decision tree models suggests that if the contact duration is higher than 830 seconds, then there will be a higher possibility that the customer would subscribe the term deposit. In the end, We decided to use the logistic regression model, since the predict result is the best and will help the company to find more true positive customer instead of just high accuracy rate.

8 Full EDA

8.1 Univariate Analysis

8.1.1 Numeric Attributes

p1 = ggplot(bank) + geom_bar(aes(x = age))
p2 = ggplot(bank) + geom_bar(aes(x = balance), width = 500)
p3 = ggplot(bank) + geom_bar(aes(x = day))
p4 = ggplot(bank) + geom_bar(aes(x = duration))
p5 = ggplot(bank) + geom_bar(aes(x = campaign))
p6 = ggplot(bank) + geom_bar(aes(x = pdays))
p7 = ggplot(bank) + geom_bar(aes(x = previous))

grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for all the customers")
## Warning: position_stack requires non-overlapping x intervals

bank_contacted <- bank[bank$previous!=0,]

Now we draw the distribution plot for contacted customers.

p1 = ggplot(bank_contacted) + geom_bar(aes(x = age))
p2 = ggplot(bank_contacted) + geom_bar(aes(x = balance))
p3 = ggplot(bank_contacted) + geom_bar(aes(x = day))
p4 = ggplot(bank_contacted) + geom_bar(aes(x = duration))
p5 = ggplot(bank_contacted) + geom_bar(aes(x = campaign))
p6 = ggplot(bank_contacted) + geom_bar(aes(x = pdays))
p7 = ggplot(bank_contacted) + geom_bar(aes(x = previous))

grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for contacted customers")

8.1.2 Categorical Attributes

After we drawed the categorical distrubutions we found that there are significant numbers of customers are blue-collar, management and technician. Most cusromers did not have credit. Moreover, most customers are contacted through cellular.

p8 = ggplot(bank) + geom_bar(aes(x = job)) + theme(axis.text.x = element_text(angle=20, hjust = 1, size=8))
p9 = ggplot(bank) + geom_bar(aes(x = marital))
p10 = ggplot(bank) + geom_bar(aes(x = education))
p11 = ggplot(bank) + geom_bar(aes(x = default))
p12 = ggplot(bank) + geom_bar(aes(x = housing))
p13 = ggplot(bank) + geom_bar(aes(x = loan))
p14 = ggplot(bank) + geom_bar(aes(x = contact))
p15 = ggplot(bank) + geom_bar(aes(x = month))
p16 = ggplot(bank) + geom_bar(aes(x = poutcome))
p17 = ggplot(bank) + geom_bar(aes(x = y))

grid.arrange(p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, nrow=5, top = "Categorical Attributes")

8.2 Bivariate Analysis

8.2.1 Measure/Measure

library(gridExtra)
library(ggcorrplot)

bank <- filter(bank, duration > 0 & pdays < 999)

#age
gg1 = ggplot(bank) + geom_point(aes(x=`age`, y = `balance`))
gg2 = ggplot(bank) + geom_point(aes(x=`age`, y = `day`))
gg3 = ggplot(bank) + geom_point(aes(x=`age`, y = `duration`))
gg4 = ggplot(bank) + geom_point(aes(x=`age`, y = `campaign`))
gg5 = ggplot(bank) + geom_point(aes(x=`age`, y = `pdays`))
gg6 = ggplot(bank) + geom_point(aes(x=`age`, y = `previous`))

grid.arrange(gg1, gg2, gg3, gg4,gg5,gg6, nrow=3)

#balance
gg7 = ggplot(bank) + geom_point(aes(x=`balance`, y = `day`))
gg8 = ggplot(bank) + geom_point(aes(x=`balance`, y = `duration`))
gg9 = ggplot(bank) + geom_point(aes(x=`balance`, y = `campaign`))
gg10 = ggplot(bank) + geom_point(aes(x=`balance`, y = `pdays`))
gg11 = ggplot(bank) + geom_point(aes(x=`balance`, y = `previous`))

grid.arrange(gg7, gg8, gg9, gg10, gg11,nrow=3)

#day
gg12 = ggplot(bank) + geom_point(aes(x=`day`, y = `duration`))
gg13 = ggplot(bank) + geom_point(aes(x=`day`, y = `campaign`))
gg14 = ggplot(bank) + geom_point(aes(x=`day`, y = `pdays`))
gg15 = ggplot(bank) + geom_point(aes(x=`day`, y = `previous`))

grid.arrange(gg12, gg13, gg14, gg15, nrow=2)

#duration
gg16 = ggplot(bank) + geom_point(aes(x=`duration`, y = `campaign`))
gg17 = ggplot(bank) + geom_point(aes(x=`duration`, y = `pdays`))
gg18 = ggplot(bank) + geom_point(aes(x=`duration`, y = `previous`))

grid.arrange(gg16, gg17, gg18, nrow=2)

#campaign
gg19 = ggplot(bank) + geom_point(aes(x=`campaign`, y = `pdays`))
gg20 = ggplot(bank) + geom_point(aes(x=`campaign`, y = `previous`))

grid.arrange(gg19, gg20,nrow=2)

#pdays
gg21 = ggplot(bank) + geom_point(aes(x=`pdays`, y = `previous`))

8.2.2 Category/ Category

#job by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill = job), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = job), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = job), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = job), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = job), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = job), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = job), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = job), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = job), position = "fill") + labs(y = "Percent")

grid.arrange(g2, g3, g4,  nrow=3, top = "job by Category")

grid.arrange(g5, g6, g7,  nrow=3, top = "job by Category")

grid.arrange(g8, g9,  nrow=3, top = "job by Category")

#marital by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =marital), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = marital), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = marital), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = marital), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = marital), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = marital), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = marital), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = marital), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = marital), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g3, g4,  nrow=3, top = "marital by Category")

grid.arrange(g5, g6, g7,  nrow=3, top = "marital by Category")

grid.arrange(g8, g9,  nrow=3, top = "marital by Category")

#education by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =education), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = education), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = education), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = education), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = education), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = education), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = education), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = education), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = education), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g4,  nrow=3, top = "education by Category")

grid.arrange(g5, g6, g7,  nrow=3, top = "education by Category")

grid.arrange(g8, g9,  nrow=3, top = "education by Category")

#default by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =default), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = default), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = default), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = default), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = default), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = default), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = default), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = default), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = default), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g3,  nrow=3, top = "default by Category")

grid.arrange(g5, g6, g7,  nrow=3, top = "default by Category")

grid.arrange(g8, g9,  nrow=3, top = "default by Category")

#housing by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =housing), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = housing), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = housing), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = housing), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = housing), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = housing), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = housing), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = housing), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = housing), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g3,  nrow=3, top = "housing by Category")

grid.arrange(g4, g6, g7,  nrow=3, top = "housing by Category")

grid.arrange(g8, g9,  nrow=3, top = "housing by Category")

#loan by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =loan), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = loan), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = loan), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = loan), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = loan), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = loan), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = loan), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = loan), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = loan), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g3,  nrow=3, top = "loan by Category")

grid.arrange(g4, g5, g7,  nrow=3, top = "loan by Category")

grid.arrange(g8, g9,  nrow=3, top = "loan by Category")

#contact by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =contact), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = contact), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = contact), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = contact), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = contact), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = contact), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = contact), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = contact), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = contact), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g3,  nrow=3, top = "contact by Category")

grid.arrange(g4, g5, g6,  nrow=3, top = "contact by Category")

grid.arrange(g8, g9,  nrow=3, top = "contact by Category")

#month by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =month), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = month), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = month), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = month), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = month), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = month), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = month), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = month), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = month), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g3,  nrow=3, top = "month by Category")

grid.arrange(g4, g5, g6,  nrow=3, top = "month by Category")

grid.arrange(g7, g9,  nrow=3, top = "month by Category")

#poutcome by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =poutcome), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = poutcome), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = poutcome), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = poutcome), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = poutcome), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = poutcome), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = poutcome), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = poutcome), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = poutcome), position = "fill") + labs(y = "Percent")

grid.arrange(g1, g2, g3,  nrow=3, top = "poutcome by Category")

grid.arrange(g4, g5, g6,  nrow=3, top = "poutcome by Category")

grid.arrange(g7, g8,  nrow=3, top = "poutcome by Category")

8.2.3 Category/ Measure

cm10 = ggplot(bank) + geom_boxplot(aes(x=job, y = age)) + theme(axis.title.y = element_blank())
cm11 = ggplot(bank) + geom_boxplot(aes(x=marital, y = age)) + theme(axis.title.y = element_blank())
cm12 = ggplot(bank) + geom_boxplot(aes(x=education, y = age))+ theme(axis.title.y = element_blank())
cm13 = ggplot(bank) + geom_boxplot(aes(x=default, y = age)) + theme(axis.title.y = element_blank())
cm14= ggplot(bank) + geom_boxplot(aes(x=housing, y = age)) + theme(axis.title.y = element_blank())
cm15 = ggplot(bank) + geom_boxplot(aes(x=loan, y = age)) + theme(axis.title.y = element_blank())
cm16 = ggplot(bank) + geom_boxplot(aes(x=contact, y = age)) + theme(axis.title.y = element_blank())
cm17 = ggplot(bank) + geom_boxplot(aes(x=month, y = age)) + theme(axis.title.y = element_blank())
cm18 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = age)) + theme(axis.title.y = element_blank())

grid.arrange(cm10,cm11,cm12,cm13,cm14,cm15,cm16,cm17,cm18, nrow=3, top = "age by Category")

cm20 = ggplot(bank) + geom_boxplot(aes(x=job, y = balance)) + theme(axis.title.y = element_blank())
cm21 = ggplot(bank) + geom_boxplot(aes(x=marital, y = balance)) + theme(axis.title.y = element_blank())
cm22 = ggplot(bank) + geom_boxplot(aes(x=education, y = balance))+ theme(axis.title.y = element_blank())
cm23 = ggplot(bank) + geom_boxplot(aes(x=default, y = balance)) + theme(axis.title.y = element_blank())
cm24= ggplot(bank) + geom_boxplot(aes(x=housing, y = balance)) + theme(axis.title.y = element_blank())
cm25 = ggplot(bank) + geom_boxplot(aes(x=loan, y = balance)) + theme(axis.title.y = element_blank())
cm26 = ggplot(bank) + geom_boxplot(aes(x=contact, y = balance)) + theme(axis.title.y = element_blank())
cm27 = ggplot(bank) + geom_boxplot(aes(x=month, y = balance)) + theme(axis.title.y = element_blank())
cm28 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = balance)) + theme(axis.title.y = element_blank())
grid.arrange(cm20,cm21,cm22,cm23,cm24,cm25,cm26,cm27,cm28, nrow=3, top = "balance by Category")

cm30 = ggplot(bank) + geom_boxplot(aes(x=job, y = duration)) + theme(axis.title.y = element_blank())
cm31 = ggplot(bank) + geom_boxplot(aes(x=marital, y = duration)) + theme(axis.title.y = element_blank())
cm32 = ggplot(bank) + geom_boxplot(aes(x=education, y = duration))+ theme(axis.title.y = element_blank())
cm33 = ggplot(bank) + geom_boxplot(aes(x=default, y = duration)) + theme(axis.title.y = element_blank())
cm34= ggplot(bank) + geom_boxplot(aes(x=housing, y = duration)) + theme(axis.title.y = element_blank())
cm35 = ggplot(bank) + geom_boxplot(aes(x=loan, y = duration)) + theme(axis.title.y = element_blank())
cm36 = ggplot(bank) + geom_boxplot(aes(x=contact, y = duration)) + theme(axis.title.y = element_blank())
cm37 = ggplot(bank) + geom_boxplot(aes(x=month, y = duration)) + theme(axis.title.y = element_blank())
cm38 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = duration)) + theme(axis.title.y = element_blank())
grid.arrange(cm30,cm31,cm32,cm33,cm34,cm35,cm36,cm37,cm38, nrow=3, top = "duration by Category")

cm40 = ggplot(bank) + geom_boxplot(aes(x=job, y = campaign)) + theme(axis.title.y = element_blank())
cm41 = ggplot(bank) + geom_boxplot(aes(x=marital, y = campaign)) + theme(axis.title.y = element_blank())
cm42 = ggplot(bank) + geom_boxplot(aes(x=education, y = campaign))+ theme(axis.title.y = element_blank())
cm43 = ggplot(bank) + geom_boxplot(aes(x=default, y = campaign)) + theme(axis.title.y = element_blank())
cm44= ggplot(bank) + geom_boxplot(aes(x=housing, y = campaign)) + theme(axis.title.y = element_blank())
cm45 = ggplot(bank) + geom_boxplot(aes(x=loan, y = campaign)) + theme(axis.title.y = element_blank())
cm46 = ggplot(bank) + geom_boxplot(aes(x=contact, y = campaign)) + theme(axis.title.y = element_blank())
cm47 = ggplot(bank) + geom_boxplot(aes(x=month, y = campaign)) + theme(axis.title.y = element_blank())
cm48 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = campaign)) + theme(axis.title.y = element_blank())
grid.arrange(cm40,cm41,cm42,cm43,cm44,cm45,cm46,cm47,cm48, nrow=3, top = "campaign by Category")

cm50 = ggplot(bank) + geom_boxplot(aes(x=job, y = pdays)) + theme(axis.title.y = element_blank())
cm51 = ggplot(bank) + geom_boxplot(aes(x=marital, y = pdays)) + theme(axis.title.y = element_blank())
cm52 = ggplot(bank) + geom_boxplot(aes(x=education, y = pdays))+ theme(axis.title.y = element_blank())
cm53 = ggplot(bank) + geom_boxplot(aes(x=default, y = pdays)) + theme(axis.title.y = element_blank())
cm54= ggplot(bank) + geom_boxplot(aes(x=housing, y = pdays)) + theme(axis.title.y = element_blank())
cm55 = ggplot(bank) + geom_boxplot(aes(x=loan, y = pdays)) + theme(axis.title.y = element_blank())
cm56 = ggplot(bank) + geom_boxplot(aes(x=contact, y = pdays)) + theme(axis.title.y = element_blank())
cm57 = ggplot(bank) + geom_boxplot(aes(x=month, y = pdays)) + theme(axis.title.y = element_blank())
cm58 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = pdays)) + theme(axis.title.y = element_blank())
grid.arrange(cm50,cm51,cm52,cm53,cm54,cm55,cm56,cm57,cm58, nrow=3, top = "pdays by Category")

cm60 = ggplot(bank) + geom_boxplot(aes(x=job, y = previous)) + theme(axis.title.y = element_blank())
cm61 = ggplot(bank) + geom_boxplot(aes(x=marital, y = previous)) + theme(axis.title.y = element_blank())
cm62 = ggplot(bank) + geom_boxplot(aes(x=education, y = previous))+ theme(axis.title.y = element_blank())
cm63 = ggplot(bank) + geom_boxplot(aes(x=default, y = previous)) + theme(axis.title.y = element_blank())
cm64= ggplot(bank) + geom_boxplot(aes(x=housing, y = previous)) + theme(axis.title.y = element_blank())
cm65 = ggplot(bank) + geom_boxplot(aes(x=loan, y = previous)) + theme(axis.title.y = element_blank())
cm66 = ggplot(bank) + geom_boxplot(aes(x=contact, y = previous)) + theme(axis.title.y = element_blank())
cm67 = ggplot(bank) + geom_boxplot(aes(x=month, y = previous)) + theme(axis.title.y = element_blank())
cm68 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = previous)) + theme(axis.title.y = element_blank())
grid.arrange(cm60,cm61,cm62,cm63,cm64,cm65,cm66,cm67,cm68, nrow=3, top = "previous by Category")

cm70 = ggplot(bank) + geom_boxplot(aes(x=job, y = day)) + theme(axis.title.y = element_blank())
cm71 = ggplot(bank) + geom_boxplot(aes(x=marital, y = day)) + theme(axis.title.y = element_blank())
cm72 = ggplot(bank) + geom_boxplot(aes(x=education, y = day))+ theme(axis.title.y = element_blank())
cm73 = ggplot(bank) + geom_boxplot(aes(x=default, y = day)) + theme(axis.title.y = element_blank())
cm74= ggplot(bank) + geom_boxplot(aes(x=housing, y = day)) + theme(axis.title.y = element_blank())
cm75 = ggplot(bank) + geom_boxplot(aes(x=loan, y = day)) + theme(axis.title.y = element_blank())
cm76 = ggplot(bank) + geom_boxplot(aes(x=contact, y = day)) + theme(axis.title.y = element_blank())
cm77 = ggplot(bank) + geom_boxplot(aes(x=month, y = day)) + theme(axis.title.y = element_blank())
cm78 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = day)) + theme(axis.title.y = element_blank())
grid.arrange(cm70,cm71,cm72,cm73,cm74,cm75,cm76,cm77,cm78, nrow=3, top = "day by Category")